fs.http: prevent hangs under some network conditions #7460

dtrifiro · 2022-03-14T13:58:43Z

When using an https remote (such as in the dvc-bench remote, see iterative/dvc-bench#319), under certain situations, dvc pull would freeze and never return. Note that this did not happen 100% of the time, but often enough to make the bench CI always fail (note: CI was running a matrix strategy over ~12 workers, pulling a 25k file dataset on each).

Investigation pointed to network issues (as the issue could not be reproduced locally). I ended up tracking this to a connection being dropped/lost and aiohttp not realizing this. Due to our timeout defaults for aiohttp.ClientSession:

aiohttp.ClientTimeout(
            total=None,
            connect=self.REQUEST_TIMEOUT,
            sock_connect=self.REQUEST_TIMEOUT,
            sock_read=None,
        )

(note that sock_read=None), aiohttp would keep trying to read from a dead socket thus hang.

By instantiating a aiohttp.TCPConnector with enable_cleanup_closed=True, the underlying transport is forcefully closed when the connection is lost, so that the request fails and RetryClient retries.

Thanks @pared and @karajan1001

fixes #7414

dberenbaum · 2022-03-14T14:29:40Z

@dtrifiro Is it possible to show the difference before and after? Possibly showing the output from each, a video, or a reproducible script?

dvc/fs/http.py

dtrifiro · 2022-03-14T16:52:27Z

dvc/fs/http.py

@@ -114,7 +120,7 @@ async def get_client(self, **kwargs):
            total=None,
            connect=self.REQUEST_TIMEOUT,
            sock_connect=self.REQUEST_TIMEOUT,
-            sock_read=None,
+            sock_read=self.REQUEST_TIMEOUT,


I'm assuming that setting the sock_read=self.REQUEST_TIMEOUT (60s) will not create any issues: the default buffer size is 2**16,

See ClientSession._request:
https://github.com/aio-libs/aiohttp/blob/f5ff95efe278c470a2ff65cabbb5f5f08ba07416/aiohttp/client.py#L511-L519

https://github.com/aio-libs/aiohttp/blob/f5ff95efe278c470a2ff65cabbb5f5f08ba07416/aiohttp/client_proto.py#L17

https://github.com/aio-libs/aiohttp/blob/f5ff95efe278c470a2ff65cabbb5f5f08ba07416/aiohttp/streams.py#L111

dtrifiro · 2022-03-15T11:07:17Z

For the record, here's a few examples of failing CI (timeout) on github.com/iterative/dvc-bench/actions

https://github.com/iterative/dvc-bench/actions/runs/1977924560
https://github.com/iterative/dvc-bench/actions/runs/1861766336
https://github.com/iterative/dvc-bench/actions/runs/1726203498 (one of the first builds that started failing ~2 months ago

Here's a few example of succeeding CI (running this branch):

pared

Amazing research!

efiop · 2022-03-15T12:02:20Z

pared left a comment
Amazing research!

+1, amazing work 🔥

skshetry · 2022-03-15T12:50:40Z

dvc/fs/http.py

+                client_kwargs["connector"] = aiohttp.TCPConnector(
+                    enable_cleanup_closed=True
+                )


Is there more detail on this? I'd like to see why this is not the default in aiohttp.

Looks like aiohttp used to do this by default 5 years back but was changed by this commit aio-libs/aiohttp@02b0951 as an option. But I could not find any reasoning.

From https://docs.aiohttp.org/en/stable/client_reference.html#tcpconnector:

Some ssl servers do not properly complete SSL shutdown process, in that case asyncio leaks SSL connections. If this parameter is set to True, aiohttp additionally aborts underlining transport after 2 seconds. It is off by default.

From the description of how it works, it seems more reasonable that this isn't the default.

I was too wondering why this is not the default, but digging around could not find any reasoning either. I guess this only becomes an issue under particular network conditions (broken TLS connection?), which were often encountered in dvc bench due to our large number of concurrent requests in dvc bench.

It looks like this could be the context for the change: aio-libs/aiohttp#1767.

Edit to add some more explanation: The discussion is right around the time of the change, the contributor is in the discussion, and they discuss the related (now outdated) attribute cleanup_closed_disabled.

I guess we could also ask 😄

dtrifiro · 2022-03-15T18:19:08Z

Added some context comments with the last force push

skshetry · 2022-03-22T06:06:50Z

dvc/fs/http.py

@@ -114,7 +123,7 @@ async def get_client(self, **kwargs):
            total=None,
            connect=self.REQUEST_TIMEOUT,
            sock_connect=self.REQUEST_TIMEOUT,
-            sock_read=None,
+            sock_read=self.REQUEST_TIMEOUT,


I am curious to see if only this will solve the problem in dvc-bench, and a bit hesitant on adding TCPConnector
without trying out this solution first. Would you mind opening another PR with just this line change? We can merge that, and we can keep this PR open for a 1-2 days to monitor.

I've tried, it solves the issues on my machine ™️ (MacOS) but it still freezes on the dvc-bench CI

dtrifiro · 2022-03-23T12:16:07Z

Gonna test out a few extra scenarios before merging this

We have been getting reports that the timeout on sock_read was raising timeout error even for chunked uploads, and sometimes even uploading zero-byte files. See: https://github.com/iterative/dvc/issues/8065 and iterative/dvc#8100. These kinds of logics don't belong here, and should be upstreamed (eg: RetryClient/ClientTimeout, etc). We added timeout in iterative/dvc#7460 because of freezes in iterative/dvc#7414. I think we can rollback this for now given that there are lots of report of failures/issues with this line, and if we get any new reports of hangs, we'll investigate it separately.

dtrifiro requested a review from a team as a code owner March 14, 2022 13:58

dtrifiro requested review from pmrowla and pared March 14, 2022 13:58

dtrifiro commented Mar 14, 2022

View reviewed changes

dvc/fs/http.py Outdated Show resolved Hide resolved

dtrifiro commented Mar 14, 2022

View reviewed changes

pared approved these changes Mar 15, 2022

View reviewed changes

pmrowla approved these changes Mar 15, 2022

View reviewed changes

dtrifiro changed the title ~~fix hang on pull~~ fs.http: prevent hangs Mar 15, 2022

skshetry reviewed Mar 15, 2022

View reviewed changes

dtrifiro changed the title ~~fs.http: prevent hangs~~ fs.http: prevent hangs under some network conditions Mar 15, 2022

dtrifiro added 2 commits March 15, 2022 19:14

fs.http: force cleanup of closed connections

1574d9a

fs.http: set sock_read timeout

12de7e7

dtrifiro force-pushed the fix/7414-hang-on-pull branch from 1745f16 to 12de7e7 Compare March 15, 2022 18:15

efiop requested a review from skshetry March 17, 2022 17:28

skshetry reviewed Mar 22, 2022

View reviewed changes

skshetry approved these changes Mar 23, 2022

View reviewed changes

dtrifiro mentioned this pull request Mar 23, 2022

pull: Using jobs>1 fails with RuntimeError: Session is closed in http remote #7421

Closed

daavoo linked an issue Mar 24, 2022 that may be closed by this pull request

pull: Using jobs>1 fails with RuntimeError: Session is closed in http remote #7421

Closed

dtrifiro merged commit 5741927 into iterative:main Mar 28, 2022

dtrifiro deleted the fix/7414-hang-on-pull branch March 28, 2022 06:21

dtrifiro mentioned this pull request Mar 31, 2022

push/pull: fails to finish querying the remote (azure blob) #7337

Closed

themaikelman mentioned this pull request Apr 10, 2022

push: Unnotified error when pushing data into HTTP remote #7564

Closed

skshetry mentioned this pull request Aug 4, 2022

dvc push fails with "Timeout on reading data from socket" with HTTP storage iterative/dvc-http#27

Closed

zhansliu mentioned this pull request Aug 5, 2022

Large files get truncated on dvc push to HTTP remote #8100

Closed

skshetry mentioned this pull request Aug 5, 2022

fs: http: do not timeout on sock_read iterative/dvc-objects#117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fs.http: prevent hangs under some network conditions #7460

fs.http: prevent hangs under some network conditions #7460

dtrifiro commented Mar 14, 2022 •

edited

Loading

dberenbaum commented Mar 14, 2022

dtrifiro Mar 14, 2022

dtrifiro commented Mar 15, 2022

pared left a comment

efiop commented Mar 15, 2022

skshetry Mar 15, 2022

skshetry Mar 15, 2022

dberenbaum Mar 15, 2022

dtrifiro Mar 15, 2022

dberenbaum Mar 15, 2022 •

edited

Loading

dberenbaum Mar 15, 2022

dtrifiro commented Mar 15, 2022

skshetry Mar 22, 2022

dtrifiro Mar 22, 2022 •

edited

Loading

dtrifiro commented Mar 23, 2022

fs.http: prevent hangs under some network conditions #7460

fs.http: prevent hangs under some network conditions #7460

Conversation

dtrifiro commented Mar 14, 2022 • edited Loading

dberenbaum commented Mar 14, 2022

dtrifiro Mar 14, 2022

Choose a reason for hiding this comment

dtrifiro commented Mar 15, 2022

pared left a comment

Choose a reason for hiding this comment

efiop commented Mar 15, 2022

skshetry Mar 15, 2022

Choose a reason for hiding this comment

skshetry Mar 15, 2022

Choose a reason for hiding this comment

dberenbaum Mar 15, 2022

Choose a reason for hiding this comment

dtrifiro Mar 15, 2022

Choose a reason for hiding this comment

dberenbaum Mar 15, 2022 • edited Loading

Choose a reason for hiding this comment

dberenbaum Mar 15, 2022

Choose a reason for hiding this comment

dtrifiro commented Mar 15, 2022

skshetry Mar 22, 2022

Choose a reason for hiding this comment

dtrifiro Mar 22, 2022 • edited Loading

Choose a reason for hiding this comment

dtrifiro commented Mar 23, 2022

dtrifiro commented Mar 14, 2022 •

edited

Loading

dberenbaum Mar 15, 2022 •

edited

Loading

dtrifiro Mar 22, 2022 •

edited

Loading